The data is extracted from 1994 US census database and was found at the UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Adult
We will try to analyze how different sociodemographical indicators affect the likelihood of a person earning more than 50,000$ a year.
The comments about the chunk are given before the chunk.
First, let’s import the dataset and format it a bit for easier exploration.
Initialize, read file, assign column names.
Change actual NA values to proper NA object, and drop unused levels.
library(ggplot2)
library(dplyr)
library(reshape2)
library(scales)
adult <- read.csv('~/DataAnalyst/Projects/DataAnalystND_Project_4/adult/adult.data', header = F)
names(adult) <- c('age','workclass','fnlwgt','education','education_num',
'marital_status','occupation','relationship','race','sex',
'capital_gain','capital_loss','hours_per_week','native_country','income')
levels(adult$income) <- c('low','high')
adult[adult ==' ?'] <- NA
adult <- droplevels(adult)
#This addition to ggplot plots will set alpha of the legend to 100% for better readability.
fix_alpha <- guides(colour = guide_legend(override.aes = list(alpha = 1)))
Arrange education levels by the provided ‘education_num’ variable.
Arrange other factors by frequency of high salary(High Salary Ratio, HSR) - from lowest to highest.
Remove unneeded columns.
Group very low frequency workclass levels together.
Remove spaces from factor levels.
I’ll refer to high salary ratio of a group (number of people having high income divided by group size) as HSR.
We’ve stored HSR’s for variable levels for each variable in the adult_by[[variable]] list.
Let’s look how each variable affects the income:
age: HSR increases from 16 to 50, then declines. For some reason, ages 79 and 83 have very high HSR.
workclass: By far most people work in Private sector which also has the lowest HSR. Being Self-emp-inc (probably company owners) is paid very good, working for federal government is also paid well.
education: having high education definitely increases income.
marital_status: married-civ-spouse and married-af-spouse have the highest HSR.
About 1/3 of the respondents were never married, and they have the lowest HSR.
occupation: About 25% of the data are Prof-speciality and Exec-managerial - two highest HSR categories.
Priv-house-serv occupation has HSR of only 0.6%.
relationship: Wives and Husbands have very high HSR. Wives have even higher HSR than Husbands, despite that women have HSR of 11% and men of 30.5%.
race: Black and Native American have half as high HSR than White and Asian.
sex: For some reason, there are twice as many men as women in the survey, and for men the HSR is 3 times higher than HSR for women.
hours per week: 0 to 25 hpw - HSR decreases a bit, 25-60 - increases, and then decreases, probably because the top-paid executives don’t work long hours.
native_country: Caribbean and Latin-American have the lowest HSR. US-natives are somewhere in the middle, and the top of the list cannot be attributed to some region in particualar.
90% of the people are US-natives.
for (name in names(adult[,-11])) {
print(adult_by[[name]])
}
## Source: local data frame [73 x 3]
##
## age high_salary_ratio n
## (int) (dbl) (int)
## 1 17 0.000000000 395
## 2 18 0.000000000 550
## 3 20 0.000000000 753
## 4 82 0.000000000 12
## 5 85 0.000000000 3
## 6 86 0.000000000 1
## 7 87 0.000000000 1
## 8 88 0.000000000 3
## 9 19 0.002808989 712
## 10 21 0.004166667 720
## .. ... ... ...
## Source: local data frame [8 x 3]
##
## workclass high_salary_ratio n
## (fctr) (dbl) (int)
## 1 No_pay 0.0000000 21
## 2 NA 0.1040305 1836
## 3 Private 0.2186729 22696
## 4 State-gov 0.2719569 1298
## 5 Self-emp-not-inc 0.2849272 2541
## 6 Local-gov 0.2947922 2093
## 7 Federal-gov 0.3864583 960
## 8 Self-emp-inc 0.5573477 1116
## Source: local data frame [16 x 3]
##
## education high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Preschool 0.00000000 51
## 2 1st-4th 0.03571429 168
## 3 5th-6th 0.04804805 333
## 4 11th 0.05106383 1175
## 5 9th 0.05252918 514
## 6 7th-8th 0.06191950 646
## 7 10th 0.06645230 933
## 8 12th 0.07621247 433
## 9 HS-grad 0.15950862 10501
## 10 Some-college 0.19023454 7291
## 11 Assoc-acdm 0.24835989 1067
## 12 Assoc-voc 0.26121563 1382
## 13 Bachelors 0.41475257 5355
## 14 Masters 0.55658735 1723
## 15 Prof-school 0.73437500 576
## 16 Doctorate 0.74092010 413
## Source: local data frame [7 x 3]
##
## marital_status high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Never-married 0.04596087 10683
## 2 Separated 0.06439024 1025
## 3 Married-spouse-absent 0.08133971 418
## 4 Widowed 0.08559919 993
## 5 Divorced 0.10420887 4443
## 6 Married-AF-spouse 0.43478261 23
## 7 Married-civ-spouse 0.44684829 14976
## Source: local data frame [15 x 3]
##
## occupation high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Priv-house-serv 0.006711409 149
## 2 Other-service 0.041578149 3295
## 3 Handlers-cleaners 0.062773723 1370
## 4 NA 0.103635377 1843
## 5 Armed-Forces 0.111111111 9
## 6 Farming-fishing 0.115694165 994
## 7 Machine-op-inspct 0.124875125 2002
## 8 Adm-clerical 0.134482759 3770
## 9 Transport-moving 0.200375704 1597
## 10 Craft-repair 0.226640644 4099
## 11 Sales 0.269315068 3650
## 12 Tech-support 0.304956897 928
## 13 Protective-serv 0.325115562 649
## 14 Prof-specialty 0.449033816 4140
## 15 Exec-managerial 0.484013773 4066
## Source: local data frame [6 x 3]
##
## relationship high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Own-child 0.01322021 5068
## 2 Other-relative 0.03771662 981
## 3 Unmarried 0.06326175 3446
## 4 Not-in-family 0.10307044 8305
## 5 Husband 0.44857121 13193
## 6 Wife 0.47512755 1568
## Source: local data frame [5 x 3]
##
## race high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Other 0.09225092 271
## 2 Amer-Indian-Eskimo 0.11575563 311
## 3 Black 0.12387964 3124
## 4 White 0.25585994 27816
## 5 Asian-Pac-Islander 0.26564004 1039
## Source: local data frame [2 x 3]
##
## sex high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Female 0.1094606 10771
## 2 Male 0.3057366 21790
## Source: local data frame [94 x 3]
##
## hours_per_week high_salary_ratio n
## (int) (dbl) (int)
## 1 11 0 11
## 2 19 0 14
## 3 23 0 21
## 4 31 0 5
## 5 74 0 1
## 6 77 0 6
## 7 81 0 3
## 8 82 0 1
## 9 86 0 2
## 10 87 0 1
## .. ... ... ...
## Source: local data frame [42 x 3]
##
## native_country high_salary_ratio n
## (fctr) (dbl) (int)
## 1 Holand-Netherlands 0.00000000 1
## 2 Outlying-US(Guam-USVI-etc) 0.00000000 14
## 3 Dominican-Republic 0.02857143 70
## 4 Columbia 0.03389831 59
## 5 Guatemala 0.04687500 64
## 6 Mexico 0.05132193 643
## 7 Nicaragua 0.05882353 34
## 8 Peru 0.06451613 31
## 9 Vietnam 0.07462687 67
## 10 Honduras 0.07692308 13
## .. ... ... ...
We’ve talked about age, but the relationship is seen better in the graph.
ggplot(data = adult_by[['age']], aes(x = age, y = high_salary_ratio)) +
geom_line()
Let’s break the age-hsr by education and sex.
adult_by_age_sex<- adult %>%
group_by(age, sex) %>%
summarise(high_salary_ratio = sum(income == 'high')/n(),
n = n()) %>%
arrange(age, sex)
ggplot(adult_by_age_sex, aes(x = age, y = high_salary_ratio, color = sex)) +
geom_line()
adult_by_age_education<- adult %>%
group_by(age, education) %>%
summarise(high_salary_ratio = sum(income == 'high')/n(),
n = n()) %>%
arrange(age, education)
ggplot(adult_by_age_education, aes(x = age, y = high_salary_ratio, color = education)) +
geom_line(size = 2)
ggplot(adult_by_age_education, aes(x = age, y = high_salary_ratio, color = education)) +
geom_point(size = 5)
Women are younger than men.
ggplot(adult, aes(x = sex, y = age, fill = sex)) +
geom_violin()
As we’ve seen, high-income people are older.
ggplot(adult, aes(x = income, y = age, fill = income)) +
geom_violin()
The HPW distribution looks somewhat normal. Hard to say, but looks like most of the high income comes from high HPW.
ggplot(data = adult, aes(x = hours_per_week, fill = income)) +
geom_histogram(binwidth = 2) +
coord_cartesian(ylim = c(0,3000))
ggplot(data = adult, aes(x = hours_per_week, fill = income)) +
geom_histogram(binwidth = 2, position = 'dodge') +
coord_cartesian(ylim = c(0,3000))
Working about 60 hpw is paid best, working more - a little less.
It’s interesting that the slope of the curve around the 60 hpw is similar on both sides.
ggplot(data = adult_by[['hours_per_week']], aes(x = hours_per_week, y = high_salary_ratio)) +
geom_line() +
geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Let’s see how much people with different education work. There’s a definite curve in the mean hpw (red diamonds).
ggplot(adult, aes(x = education, y = hours_per_week, fill = education)) +
geom_boxplot() +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3, show_guide = FALSE)
Most of the people are HS-grads, Some-college or Barchelors. Nothing particularly interesting about the races.
ggplot(data = adult, aes(x = education, fill = race)) +
geom_histogram()
HSR for men with high education is very high.
ggplot(data = adult, aes(x = education, fill = income)) +
geom_histogram() +
facet_wrap(~sex, ncol = 1)
As we’ve already seen, people with better education are older. No surprise here.
People who have have average education have very high age difference between high and low income.
ggplot(data = adult, aes(x = education, y = age, fill = income)) +
geom_boxplot()
We see that people of the lower education categories (from preschool to about 9th grade) are older than the subsequent categories.
That is probably because these people are most likely dropouts (and they could drop out a while ago), while people with 10th grade education or higher could be still studying (we have respondents of age 16+ in the survey).
Men with lower education are younger than women with same education, and the opposite is true for higher education.
ggplot(data = adult, aes(x = education, y = age, fill = sex)) +
geom_boxplot()
High-income people on average work 6.6 more hours_per_week than low-income.
by(adult$hours_per_week,adult$income, mean)
## adult$income: low
## [1] 38.84021
## --------------------------------------------------------
## adult$income: high
## [1] 45.47303
ggplot(data = adult, aes(x = income, y = hours_per_week)) +
geom_boxplot() +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE)
High-income people are on average 7.5 years older.
by(adult$age,adult$income, mean)
## adult$income: low
## [1] 36.78374
## --------------------------------------------------------
## adult$income: high
## [1] 44.24984
ggplot(data = adult, aes(x = income, y = age)) +
geom_boxplot() +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE)
Most of the husbands and wives(highest HSR relationships) are married-civ-spouse(highest HSR marital status).
ggplot(adult, aes(x = relationship, fill = marital_status)) +
geom_histogram()
ggplot(adult, aes(x = marital_status, fill = relationship)) +
geom_histogram()
As people become older, they tend to either marry or become widowed and divorce less:
ggplot(data = adult, aes(x = age, fill = marital_status)) +
geom_histogram(binwidth = 1)
Most of the younger people are children.
As we’ve seen, number of husbands is much higher than number of wives.
ggplot(data = adult, aes(x = age, fill = relationship)) +
geom_histogram(binwidth = 1)
Let’s break it by sexes.
There are way more married men than women.
The histogram for women is much more skewed to the left than for men.
ggplot(data = adult, aes(x = age, fill = relationship)) +
geom_histogram(binwidth = 1) +
facet_wrap(~sex, ncol = 1)
We see that in Own-child and Husband/Wife relationships women are older, and in other relationships women are younger.
ggplot(data = adult, aes(x = relationship, y = age, fill = sex)) +
geom_boxplot()
We have 2 male wives and 1 female husband. These are probably errors.
nrow(subset(adult, sex == 'Male' & relationship == 'Wife'))
## [1] 2
nrow(subset(adult, sex == 'Female' & relationship == 'Husband'))
## [1] 1
Older people and women work less.
ggplot(data = adult, aes(x = age, y = hours_per_week, color = sex)) +
geom_jitter(alpha = 0.5) +
fix_alpha
The majority of high-income people come from 2 highest-paid occupations.
ggplot(data = adult, aes(x = occupation, fill = income)) +
geom_histogram()
ggplot(data = adult, aes(x = occupation, fill = income)) +
geom_histogram(position = 'dodge')
The people in self-employed workclasses are mostly men.
NA category has very high women/men ratio.
ggplot(data = adult, aes(x = workclass, fill = sex)) +
geom_bar(position = 'dodge')
The majority of the difference in male-female populations is due to ‘white’ race.
Black race has very high woman ratio.
ggplot(data = adult, aes(x = race, fill = sex)) +
geom_histogram(position = 'dodge')
Let’s see which occupations are dominated by either sex.
Remeber, that the occupations are arranged by HSR.
The highest female ratio is in the lowest-paid occupation(‘Priv-house-serv’).
Other female occupations: Adm-clerical and Other-service.
Male occupation: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
Pretty much as expected.
occupation_by_sex = adult %>%
group_by(occupation) %>%
summarise(female_ratio = sum(sex == 'Female')/n(),
n = n())
ggplot(data = occupation_by_sex, aes(x = occupation, y = female_ratio)) +
geom_bar(stat='identity')
The countries are ordered by HSR.
In both high-HSR and low-HSR countries there are countries with many men(Mexico, India) or women(Dominican Republic, Germany), and countries with more younger(Guatemala, Taiwan) or older(Puerto-Rico, Italy) people.
Thus, there is not much new information here.
ggplot(data = adult, aes(x = age, fill = sex)) +
geom_histogram(binwidth = 2) +
facet_wrap(~native_country, scales = 'free_y')
ggplot(data = adult, aes(x = age, fill = relationship)) +
geom_histogram(binwidth = 1) +
facet_wrap(~sex, ncol = 1) +
xlab('Age, years') +
ylab('Number of Respondents') +
ggtitle('Histogram of Ages, by Sex and relationship')
The distribution of women looks very different to the distribution of men.
There are less women than men and they are younger.
The histograms of all the relationships for men and women, except husband and wife, look the same.
ggplot(data = adult_by[['hours_per_week']], aes(x = hours_per_week, y = high_salary_ratio)) +
geom_line() +
geom_smooth() +
xlab('Hours per week working') +
ylab('High salary ratio (frequency of high-income people)') +
ggtitle('Relation of income to working hours per week')
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
The lowest HSR is at 0-25 hpw, and after that HSR rises up to about 60 hpw.
But after 60 hpw the average hsr decreases.
We see that people who work about 100 hours a week earn about the same ass people who work 40 hours.
ggplot(data = occupation_by_sex, aes(x = occupation, y = female_ratio)) +
geom_bar(stat='identity') +
scale_y_continuous(labels = percent) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab('Percentage of females in occupation') +
xlab('Occupation') +
ggtitle('Occupations by gender')
Here we can see which professions are dominated by either gender.
The occupations are ordered by HSR (leftmost occupation is worst-paid).
The occupations that have more women than others: Priv-House-Serv(lowest-paid occupation), Adm-clerical and Other-service.
Male occupations: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
The dataset has very interesting information that characterizes the respondents in different sociodemographical ways. Nevertheless, we can say with fair amount of certainty, that the dataset is not representative of the US population, for example by gender and race distributions.
In the dataset, women and black/native americans are paid much less than men and white/asian people. This does not necessarily imply discrimination, but is suspicious.
People originating from different countries differ in income, and most of the low-income countries are in Latin America. Most of the high-income countries are rich and developed, although it is surprising to see Cambodia, Yugoslavia and even Iran on the top of the list. Anyway, 90% of the people are US-natives, so this variable doesn’t give too much information.
The dataset mostly contains categorical variables, having more numeric variables would be interesting.
Mainly, if our dependent variable (income) was numeric, it would open more opportunities for exploration.
It would be also very interesting to have the same data for a different year to examine the trends.